Cleaning The Data

Some sort of introductory text to what cleaning data even means.

1. Setup

First, we need to load the packages that are going to help facilitate the data analysis. You don’t need to worry about what each package does, but this is an important setup step that allows us to use specific functions later on. I always include tidyverse, janitor and lubridate at the top of every R file.

If you want to learn more about what each of those packages does,

Tip

Add a “code chunk” to your file by hitting Command+Option+i on the keyboard. You’ll run all of your code within these chunks, so get used to this keyboard shortcut!

Installing the packages

Loading them in the R file

Copy and paste the code below and “run” it by hitting the green play button in the top right of your code chunk.

# setup 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(lubridate)
Tip

You’ll see some sort of message pop up after you run the code; as long as there’s no red “error” warning, you’re fine to move on.

2. Importing the data from LegiScan

Now you need to take the data that we downloaded to your computer and import it into this R file. It sounds complicated, but all it takes is a simple line of code.

2.1 Create your data folders

If you open the files we downloaded from LegiScan in the previous step, you’ll see it actually includes seven different data tables. For the purposes of this project, we’re only going to be working with three of those: “bills,” “sponsors,” and “people.”

Create a new code chunk using the keyboard shortcut from above. Copy and run the code below to import those three tables into this file.

 bills <- read_csv("data-raw/tx-bills-all/2025-2026_89th_Legislature/csv/bills.csv")
Rows: 9897 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): bill_number, status_desc, title, description, committee, last_acti...
dbl  (4): bill_id, session_id, status, committee_id
date (2): status_date, last_action_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
 bills
sponsors <- read_csv("data-raw/tx-bills-all/2025-2026_89th_Legislature/csv/sponsors.csv")
Rows: 14190 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): bill_id, people_id, position

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sponsors
lawmakers <- read_csv("data-raw/tx-bills-all/2025-2026_89th_Legislature/csv/people.csv")
Rows: 181 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, first_name, middle_name, last_name, suffix, nickname, party,...
dbl  (7): people_id, party_id, role_id, followthemoney_eid, votesmart_id, kn...
lgl  (1): opensecrets_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
lawmakers

3. “Cleaning” those tables

Oftentimes, data tables aren’t in the best shape when you download them. There might be rows or cells that are missing entries, or there may be columns that aren’t necessary for the analysis you want to do. This is where cleaning comes in: you can use code to adjust the table/remove data you don’t need so the final table is much easier to work with.

3.1 Bills

Let’s clean up each of the data tables one at a time. We’ll start with our “bills” data.

A) View the raw data

Before we start cleaning up the tables, let’s take another look at what they look like untouched. Run this code to view the table you just imported.

bills 

add description of what we’re looking at here

B) Remove unnecessary columns and rename columns
bills_clean <- bills |> 
   select(-session_id, -url, -state_link) 

bills_clean 
# add this to analysis file
bills_clean |> 
  group_by(status, status_desc) |> 
  summarise(status_type = n()) |> 
  arrange(desc(status_type))
`summarise()` has grouped output by 'status'. You can override using the
`.groups` argument.

Describe what we’re looking at now.

3.2 Lawmakers

A) View the raw data
lawmakers 

Description of what we’ve got here.

B) Remove unnecessary columns and rename columns
lawmakers_clean <- lawmakers |> 
  select(people_id, name, middle_name, suffix, party, role, district) 

lawmakers_clean

Description of what we’re working with now.

3.3 Sponsors

A) View the raw data
sponsors
B) Remove unnecessary columns and rename columns
sponsors_clean <- sponsors |> 
  rename(spons_position = position)

sponsors_clean

Restate all the changes made.

4. Merge the tables together

Now that we’ve selected the columns we want to work with in each of the three data tables, we need to merge everything together into one big table that we’ll use for our analysis. You’ll notice there are a couple of repeat columns in the tables, such as people_id and bill_id. We’ll want to make sure those only appear once in our big table. Let’s do this in two steps: first, copy the code below to merge the lawmaker and sponsor tables together.

law_spons <- merge(lawmakers_clean, sponsors_clean) 

law_spons

Now lets add the description column and date_filed column from the bills table.

final_table <- law_spons |> 
  left_join(bills_clean) |> 
  rename(mid_init = middle_name) 
Joining with `by = join_by(bill_id)`

Great! You should have one big data table with the following information:

  • people_id = the unique number associated with each lawmaker

  • name = lawmaker’s first and last name

  • mid_init = lawmaker’s middle initial

  • suffix = lawmaker’s suffix, if applicable

  • party = political party of each lawmaker

  • role = whether the lawmaker is a senator or a representative

  • district = the congressional district associated with each lawmaker

  • bill_id = the unique number associated with a specific bill filed this session

  • bill_number = standard bill naming in the legislature

  • spons_position = for a specific bill, this number indicates where a lawmaker is in the line of sponsor (ex: “1”=primary sponsor, “7”=seventh sponsor, etc.)

  • status = a numerical representation of where the bill currently sits (ex: “0”=no progress, “1”=Introduced, “2”=Engrossed, “4”=Passed) There is no 3, I don’t know why.

  • status_desc = written description of the status

  • status_date = date a specific bill achieved the corresponding status

  • title = shorter description of bill as written in LegiScan

  • description = longer description of bill as written in LegiScan

  • committee = which Senate/House committee the bill was referred to

  • committee_id = unique code associated with each committee

  • last_action = most recent update to bill as of the date you downloaded the data

  • last_action_date = calendar date of that update

This is a lot of information to work with! We may not end up using every column, but it’s still important to have a full picture of the data available. The table has over 14,000 rows of data, each corresponding with a bill/joint resolution. Now you can see the appeal of sorting through everything in R, rather than manually or in a spreadsheet.

Add callout note about companion bills

We’ll be able to quickly sort through all of this information and identify key details about the agendas of Texas lawmakers in the current legislative session. Let’s save this table before we move forward with our analysis.

5. Saving the clean table

Add instructions for creating a data-processed file.

final_table |> 
  write_rds("data-processed/bills-89.rds")

In the next chapter of this toolkit, we’ll brain storm some potential questions about the session and use the data to find some answers.